Bangladesh Medical Association(BMA) member data extraction

Version : 1.0
Date : 2015-05-21

This notebook will illustrate the approach undertaken to extract the BMA doctor's registration. All doctors in Bangladesh recive a registration number at BMA after successfully completing their internship. Using that number they can establish their credivbility as a doctor. Using these numbers one can verify that someone is a legitimate doctor.

Using BMA search portal one can search using only the registration number. But since it is not common in this country to routinely publish their BMA number, we need an interface using which we can search the database using doctor's name also.

Tools used:

  • Python 2
  • IPython : python module which provided a python shell for interactive computing within a browser and terminal
  • Mechanize : python module for interacting with web page and submitting form (Python 2 only module)
  • Pandas : python module for handling large dataset
  • Requests: simple HTTP library for python

Unfortunately the data is very barebone at BMA website. Doctor's name, father's name, address and an official photo is provided against each id number. But we can create a master table which we can populate from other sources.

This interface provides us 66000 medical doctor and 4000 dental doctor's worth of information. Currently we have around 70000 doctors in our country. So up can expect data upto couple year ago.

This is a first attempt to collect the data and accumulate them. Several crude hacks were employed to ensure that a working model is up and running as soon as possible. Initially the informations are dumped in a CSV files after we have all the data they will be imported into a PostgreSQL database.

First use of the database might be to implement an mobile app interface where a patient can search for a doctor by his name or registration number and see his photo to verify that he is legit doctor.

Extraction


In [6]:
#Load the necessary modules
from mechanize import Browser
import pandas as pd
from IPython.core.display import HTML
import requests

We need a function to parse the HTML data after extracting the result.


In [ ]:
def extract_sub_string(string, start, finish):
    """
    extract a substring between the 'start' substring and the first occurence of 'finish' substring after that point.
    
    :param string: main string, to be parsed
    :type string: str
    
    :param start: starting string
    :type start: str
    
    :param end: ending string
    :type end: str
    """
    new_string_index = string.find(start)
    new_string = string[new_string_index:]
    end_index =new_string.find(finish)
    final_string = string[new_string_index:new_string_index+end_index]
    return final_string

Now we extract the result pages against each of the id(1 to 66000) and store the strings in a pandas Dataframe. We will tokenize the resultant string later.


In [ ]:
start = 'doctor_info'
finish="</div"
extracted_strings = []
extracted_df = pd.DataFrame(columns=['extracted'])

for reg_no in xrange(1,66001):
    browser = Browser()
    browser.open("http://bmdc.org.bd/doctors-info/")
    for form in browser.forms():
        pass
    # We have 2 forms in this page and we going to select the second form
    browser.select_form(nr=1)
    # This form has 2 input fields, first field, search_doc_id takes an number and second field type indicates if the 
    # id is assocated to a medical doctor or dentist
    form['search_doc_id']=str(reg_no)
    form['type']=['1']
    # Submit the form and read the result
    response = browser.submit()
    content = response.read()
    str_content = str(content)
    #Extract only the relevant portion
    extracted_str = extract_sub_string(str_content, start, finish)
    extracted_strings.append(extracted_str)
    # Originally these commnted out snipppets were run so that each group of 100 doctors are recorded at a time in 
    # seperate csv files. for testing and stability purpose. Each 100 doctors took around 6-7 minutes to record.
    #if reg_no%100==0:
    #    file_number = reg_no/100
    #    extracted_df = pd.DataFrame(columns=['extracted'])
    #    extracted_df.extracted = extracted_strings
    #    extracted_df.to_csv(str(file_number)+'.csv')
    #    extracted_strings = []
extracted_df.extracted = extracted_strings
extracted_df.to_csv('all_bma_doctor.csv')

Parsing

Now upon observation we will see that nugges of information is encapsulated within a specific piece of HTML sting. Using those patterns we can extract the relevant informations.


In [ ]:
tokenized_df = pd.DataFrame(columns=['Registration','Name','Father','Address', 'Division'])

#Since originally we created a number of csv files each containing 100 doctors we parsed them differently.
#file_list = []
#for item in xrange(1,66):
#    file_list.append(str(item)+'.csv')
#for file_ in file_list:
    

df = pd.read_csv('all_bma_doctor.csv')
    
for index in df.index:
        string = df.ix[index, 'extracted']

        start="Registration Number</td>\r\n"                      
        finish='</td>\r\n                                  </tr>\r\n\r\n                                  <tr class="odd">\r\n'
        reg_no = extract_sub_string(string , start, finish)
        reg_no = reg_no.strip()
        reg_no = reg_no.split(" ")[-1]
        #reg_no

        start = '<td>Doctor\'s Name</td>\r\n' 
        finish = '</td>\r\n                                  </tr>\r\n'
        dr_name = extract_sub_string(string , start, finish)
        dr_name=dr_name.strip()
        dr_name = dr_name.split(">")[-1]
        #dr_name

        start = "<td>Father's Name</td>"
        finish = "</td>\r\n                                  </tr>"
        father = extract_sub_string(string , start, finish)
        father = father.strip()
        father = father.split(">")[-1]
        #father

        start = '<td> <address> '
        finish = "</address>"
        address = extract_sub_string(string , start, finish)
        address = address.strip()
        address = address.split("<address>")[-1]
        address = address.replace("<br/>",' ').strip()
        #address

        division = 'Medical'

        values = pd.Series()
        values['Registration'] = reg_no
        values['Name'] = dr_name
        values['Father'] = father
        values['Address'] = address
        values['Division'] = division

        tokenized_df.loc[len(tokenized_df)] = values

In [17]:
tokenized_df[5000:5010]


Out[17]:
Registration Name Father Address Division
5000 5100 Md. Shah Mizanur Rahman NaN Dist.- Pirojpur Medical
5001 5101 Momtaz Khanam NaN 73 Sabaybash Dhaka Medical
5002 5102 Santana Chakravarty NaN Supanighat Dist.- Sylhet Medical
5003 5103 Md. Masudur Rahman NaN Vill- Sarai Bidyapara Dist.- Rangpur Medical
5004 5104 Md Abdus Salam NaN Vill- Bhalaipur Dist.- Jessore Medical
5005 5105 Md. Abdul Wadud NaN 47 Dhanmondi R/a Dhaka Medical
5006 5106 Md. Abdul Wadud NaN 47, Dhanmondi R/ A, Road No-3 Dhaka Medical
5007 5107 A. H. M Mushihur Rahman NaN Vill- Shalikhan Dist.- Bogra Medical
5008 5108 Md. Bazlur Rahman Khan NaN Vill- Kursatoli Dist.- Tangail Medical
5009 5109 Feroza Begum NaN Eddalat Para Dist.- Patuakhali Medical

Photo extraction

Now we have the information about the doctors. We can also extract the image files containting the photos.


In [15]:
for bma_id in xrange(1,66001):
    f = open(str(bma_id)+'.jpg','wb')
    f.write(requests.get('http://bmdc.org.bd/dphotos/medical/'+str(bma_id)+'.JPG').content)
    f.close()

Storing into Database

Until this point the demo work was being done in Django's built-in SQLite database. Now that we have external data source we would be populating a stand-alone databaes so that is can be shared between various apps.

To-Do

  • Completing the extraction. Until this point, around 16000 doctor's information is extracted in 2 nights. Hopefully over the weekend this process will be completed.

  • Dump all the data into a database.


In [ ]: